{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ur8xi4C7S06n"
      },
      "outputs": [],
      "source": [
        "# Copyright 2024 Google LLC\n",
        "#\n",
        "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "#     https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "JAPoU8Sm5E6e"
      },
      "source": [
        "# Qwen 3 evaluation - Bring your own data eval\n",
        "\n",
        "<table align=\"left\">\n",
        "  <td style=\"text-align: center\">\n",
        "    <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/open-models/evaluation/evaluate_open_models_byod.ipynb\">\n",
        "      <img width=\"32px\" src=\"https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg\" alt=\"Google Colaboratory logo\"><br> Open in Colab\n",
        "    </a>\n",
        "  </td>\n",
        "  <td style=\"text-align: center\">\n",
        "    <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fopen-models%2Fevaluation%2Fevaluate_open_models_byod.ipynb\">\n",
        "      <img width=\"32px\" src=\"https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN\" alt=\"Google Cloud Colab Enterprise logo\"><br> Open in Colab Enterprise\n",
        "    </a>\n",
        "  </td>\n",
        "  <td style=\"text-align: center\">\n",
        "    <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/open-models/evaluation/evaluate_open_models_byod.ipynb\">\n",
        "      <img src=\"https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg\" alt=\"Vertex AI logo\"><br> Open in Vertex AI Workbench\n",
        "    </a>\n",
        "  </td>\n",
        "  <td style=\"text-align: center\">\n",
        "    <a href=\"https://github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/evaluation/evaluate_open_models_byod.ipynb\">\n",
        "      <img width=\"32px\" src=\"https://raw.githubusercontent.com/primer/octicons/refs/heads/main/icons/mark-github-24.svg\" alt=\"GitHub logo\"><br> View on GitHub\n",
        "    </a>\n",
        "  </td>\n",
        "</table>\n",
        "\n",
        "<div style=\"clear: both;\"></div>\n",
        "\n",
        "<b>Share to:</b>\n",
        "\n",
        "<a href=\"https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/evaluation/evaluate_open_models_byod.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg\" alt=\"LinkedIn logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/evaluation/evaluate_open_models_byod.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg\" alt=\"Bluesky logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/evaluation/evaluate_open_models_byod.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg\" alt=\"X logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/evaluation/evaluate_open_models_byod.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png\" alt=\"Reddit logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/evaluation/evaluate_open_models_byod.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg\" alt=\"Facebook logo\">\n",
        "</a>|"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tvgnzT1CKxrO"
      },
      "source": [
        "## Overview\n",
        "\n",
        "This notebook demonstrates how to use the Vertex AI Evaluation service to compare the performance of a fine-tuned model against its base model on a custom dataset. This \"bring your own data\" approach is essential for assessing models on tasks that are specific to your domain.\n",
        "\n",
        "You will learn how to:\n",
        "* **Set up your environment**: Install the necessary libraries and authenticate with Google Cloud.\n",
        "* **Construct an evaluation dataset**: Create a `pandas.DataFrame` containing your prompts and the corresponding responses from two different models (a \"candidate\" and a \"baseline\").\n",
        "* **Configure a pairwise evaluation task**: Use the `EvalTask` class from the Vertex AI SDK to set up a `pairwise_summarization_quality` evaluation.\n",
        "* **Execute and analyze results**: Run the evaluation, which uses a powerful LLM as an \"auto-rater\" to compare the responses, and interpret the win/loss/tie metrics."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "61RBz8LLbxCR"
      },
      "source": [
        "## Get started"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "No17Cw5hgx12"
      },
      "source": [
        "### Install Vertex AI SDK for Python and other required packages\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "tFy3H3aPgx12"
      },
      "outputs": [],
      "source": [
        "%pip install --upgrade --quiet google-cloud-aiplatform[evaluation]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dmWOrTJ3gx13"
      },
      "source": [
        "### Authenticate your notebook environment (Colab only)\n",
        "\n",
        "Authenticate your environment on Google Colab.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "NyKGtVQjgx13"
      },
      "outputs": [],
      "source": [
        "import sys\n",
        "\n",
        "if \"google.colab\" in sys.modules:\n",
        "\n",
        "    from google.colab import auth\n",
        "\n",
        "    auth.authenticate_user()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "DF4l8DTdWgPY"
      },
      "source": [
        "### Set Google Cloud project information and initialize Vertex AI SDK for Python\n",
        "\n",
        "To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com). Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Nqwi-5ufWp_B"
      },
      "outputs": [],
      "source": [
        "PROJECT_ID = \"[your-project-id]\"  # @param {type:\"string\"}\n",
        "LOCATION = \"us-central1\"  # @param {type:\"string\"}\n",
        "\n",
        "\n",
        "import vertexai\n",
        "\n",
        "vertexai.init(project=PROJECT_ID, location=LOCATION)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1ioxwfhXiGxB"
      },
      "source": [
        "### Import libraries"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ZgXVDiILiKhZ"
      },
      "outputs": [],
      "source": [
        "from google.cloud import aiplatform\n",
        "import pandas as pd\n",
        "from vertexai.evaluation import EvalTask"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "EdvJRUWRNGHE"
      },
      "source": [
        "## Pairwise comparison for Qwen3 (base vs finetuned)\n",
        "\n",
        "Now we get to the core of our tutorial. We're going to set up an evaluation to compare two sets of model responses for a medical note summarization task.\n",
        "* **Candidate Model**: A fine-tuned model (`qwen_tuned_responses`) that has been specialized for this task. Note the use of precise medical terminology.\n",
        "* **Baseline Model**: The original, general-purpose base model (`qwen_responses`). Note that its summaries are simpler and less technical.\n",
        "\n",
        "Our goal is to quantitatively determine if the fine-tuning resulted in a better model for this specific use case."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1mM0H8DbkYkp"
      },
      "source": [
        "### Construct the Evaluation Dataset\n",
        "\n",
        "The first step is to gather our data into a structured format. The Vertex AI Evaluation service seamlessly integrates with `pandas.DataFrame`. We will create a DataFrame with three essential columns:\n",
        "\n",
        "* `prompt`: The input text we want the models to summarize.\n",
        "* `response`: The output from our **candidate** model (the one we are testing, in this case, the fine-tuned model).\n",
        "* `baseline_model_response`: The output from the model we are comparing against.\n",
        "\n",
        "This structure is fundamental to pairwise evaluation. For each row, the evaluator will be asked: \"Given this prompt, is `response` or `baseline_model_response` the better summary?\""
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "WzL0IEjWhtmt"
      },
      "outputs": [],
      "source": [
        "prompts = [\n",
        "    \"Please summarize the following patient note: 72-year-old female with a past medical history of atrial fibrillation (not on anticoagulation) presented to the ED with a sudden onset of left-sided weakness and facial droop, which began 2 hours ago. Non-contrast CT head is negative for any acute bleed. NIH Stroke Scale is 12. Patient is a candidate for tPA. Neurology is on board.\",\n",
        "    \"Please summarize the following patient note: 65-year-old male with a 40-pack-year smoking history and known COPD presents with a 3-day history of increased shortness of breath, wheezing, and productive cough with yellow sputum. On exam, he is in mild respiratory distress with diffuse expiratory wheezes. ABG shows mild respiratory acidosis. He was started on Duonebs, IV Solu-Medrol, and Levaquin.\",\n",
        "    \"Please summarize the following patient note: 45-year-old male with a history of alcohol abuse presents with severe, constant epigastric pain radiating to the back, associated with nausea and vomiting. Labs are significant for lipase of 3500 U/L. CT abdomen shows peripancreatic fat stranding consistent with acute pancreatitis. Patient is made NPO and started on aggressive IV fluid hydration and pain management.\",\n",
        "    \"Please summarize the following patient note: 88-year-old female from a nursing home who tripped and fell, now with severe right hip pain and inability to bear weight. Physical exam reveals a shortened and externally rotated right leg. X-ray of the pelvis confirms a displaced right femoral neck fracture. Orthopedics has been consulted for surgical fixation.\",\n",
        "    \"Please summarize the following patient note: 19-year-old female with Type 1 Diabetes Mellitus presents with a 1-day history of polyuria, polydipsia, and vomiting. Fingerstick glucose is 'high' (>500 mg/dL). VBG shows a pH of 7.15 and bicarbonate of 10 mEq/L. Urine ketones are large. She is being admitted to the ICU for management of diabetic ketoacidosis with an insulin drip and fluid resuscitation.\",\n",
        "    \"Please summarize the following patient note: 78-year-old male with an indwelling Foley catheter presents from a skilled nursing facility with fever, confusion, and hypotension (BP 85/45). Labs show leukocytosis of 18,000 and a lactic acid of 4.2 mmol/L. Urinalysis is positive for nitrates and leukocyte esterase. Diagnosis is septic shock secondary to a UTI. He received a fluid bolus and was started on broad-spectrum antibiotics (Zosyn).\",\n",
        "    \"Please summarize the following patient note: A 2-year-old female with a URI for 2 days developed a fever of 104°F and experienced a 2-minute generalized tonic-clonic seizure. In the ED, she is postictal but arousable. Workup is consistent with a simple febrile seizure. Parents were educated on fever control and seizure precautions.\",\n",
        "    \"Please summarize the following patient note: 34-year-old female with a history of major depressive disorder, non-compliant with her SSRI, is brought in by family due to worsening depression, anhedonia, and active suicidal ideation with a plan. She is being admitted to the inpatient psychiatric unit on a 1:1 observation for safety and medication stabilization.\",\n",
        "    \"Please summarize the following patient note: 31-year-old female, G1P0 at 34 weeks gestation, presents for a routine check-up and is found to have a blood pressure of 165/110. She also reports headaches and seeing 'spots'. Urine dipstick shows 3+ proteinuria. The patient is being admitted for management of pre-eclampsia with severe features, and will be started on magnesium sulfate for seizure prophylaxis and labetalol for BP control.\",\n",
        "    \"Please summarize the following patient note: 68-year-old male presents with palpitations and lightheadedness. EKG shows atrial fibrillation with a rapid ventricular response at a rate of 140 bpm. Blood pressure is stable at 110/70. The patient was given a 20mg IV bolus of Cardizem, which converted him to normal sinus rhythm at 80 bpm. He will be started on an Eliquis for stroke prophylaxis.\",\n",
        "]\n",
        "\n",
        "qwen_tuned_responses = [\n",
        "    \"A 72-year-old female with untreated atrial fibrillation is presenting with an acute ischemic stroke within the tPA window. A non-contrast CT head was negative for hemorrhage and Neurology is proceeding with thrombolytic therapy.\",\n",
        "    \"A 65-year-old male with a significant smoking history is being treated for an acute exacerbation of COPD, presenting with respiratory distress and mild respiratory acidosis. Management includes bronchodilators, systemic steroids, and antibiotics.\",\n",
        "    \"A 45-year-old male with a history of alcohol abuse is diagnosed with acute pancreatitis, confirmed by elevated lipase and CT findings. He is being managed supportively with NPO status, IV fluids, and analgesics.\",\n",
        "    \"An 88-year-old female sustained a displaced right femoral neck fracture after a fall. She is being evaluated by Orthopedics for surgical intervention.\",\n",
        "    \"A 19-year-old female with T1DM is being admitted to the ICU for treatment of severe diabetic ketoacidosis (DKA), requiring an insulin infusion and aggressive fluid resuscitation.\",\n",
        "    \"A 78-year-old male is in septic shock secondary to a urosepsis, presenting with hypotension, confusion, and significant lactic acidosis. He is being resuscitated with IV fluids and broad-spectrum antibiotics.\",\n",
        "    \"A 2-year-old female presented with a simple febrile seizure in the setting of a URI. She is clinically stable and the family was educated on supportive care.\",\n",
        "    \"A 34-year-old female with a history of MDD is being admitted to inpatient psychiatry for acute suicidal ideation, requiring close observation and medication management.\",\n",
        "    \"A 31-year-old female at 34 weeks gestation is being admitted for pre-eclampsia with severe features. She is being started on magnesium sulfate for seizure prophylaxis and antihypertensive therapy.\",\n",
        "    \"A 68-year-old male presented with atrial fibrillation with RVR and was successfully cardioverted with diltiazem (Cardizem). He will be initiated on apixaban (Eliquis) for anticoagulation.\",\n",
        "]\n",
        "\n",
        "qwen_responses = [\n",
        "    \"A 72-year-old woman came to the hospital because her left side was weak. A CT scan of her head was normal. Her stroke score is 12. Doctors are giving her a special medicine for the stroke.\",\n",
        "    \"A 65-year-old man who smokes a lot is having trouble breathing. He has a cough and is wheezing. Doctors gave him some medicine to help him breathe and an antibiotic.\",\n",
        "    \"A 45-year-old man who drinks alcohol has bad stomach pain that goes to his back. His lab tests were very high. A CT scan showed a problem with his pancreas. He can't eat and is getting IV fluids and pain medicine.\",\n",
        "    \"An 88-year-old woman fell and hurt her hip. She can't walk on it. Her right leg looks shorter. An x-ray showed her hip is broken. A bone doctor will see her.\",\n",
        "    \"A 19-year-old woman with diabetes is sick and throwing up. Her blood sugar is very high and her blood is acidic. She is going to the ICU to get insulin and fluids.\",\n",
        "    \"An old man from a nursing home is confused and has a low blood pressure. His white blood cell count is high and he has a high lactic acid. The doctors think he has an infection from his urine and are giving him fluids and antibiotics.\",\n",
        "    \"A 2-year-old girl had a seizure that lasted for 2 minutes because she had a high fever. She is awake now. The doctors think it was a febrile seizure and told her parents what to do.\",\n",
        "    \"A 34-year-old woman is very depressed and wants to harm herself. She stopped taking her medicine. She is going to a psychiatric hospital so they can keep her safe and help with her medicine.\",\n",
        "    \"A 31-year-old pregnant woman has very high blood pressure. She has a headache and is seeing spots. She has protein in her urine. She is going to the hospital to get medicine to prevent seizures and lower her blood pressure.\",\n",
        "    \"A 68-year-old man's heart was beating very fast and irregularly. He felt dizzy. They gave him a medicine called Cardizem through an IV and his heart went back to normal. He will start a new medicine to prevent strokes.\",\n",
        "]\n",
        "\n",
        "eval_dataset = pd.DataFrame(\n",
        "    {\n",
        "        \"prompt\": prompts,\n",
        "        \"response\": qwen_tuned_responses,\n",
        "        \"baseline_model_response\": qwen_responses,\n",
        "    }\n",
        ")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "FrEsxxAmlhcA"
      },
      "outputs": [],
      "source": [
        "eval_dataset.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "W7pwueCBifIC"
      },
      "source": [
        "### Define the Evaluation Task\n",
        "\n",
        "With our dataset prepared, we now define the evaluation job using the `EvalTask` class. This object encapsulates all the information needed to run the evaluation on the Vertex AI service.\n",
        "\n",
        "* `dataset`: The DataFrame we just created.\n",
        "* `metrics`: A list of metrics to compute. We use `\"pairwise_summarization_quality\"`. This tells Vertex AI to use its model-based evaluator, configured with specific instructions for judging summary quality (e.g., coherence, accuracy, conciseness).\n",
        "* `experiment`: The name of the Vertex AI Experiment to log this run under. Using experiments is a key MLOps practice, helping you organize, track, and compare different evaluation runs over time."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "fm4xanSWifVb"
      },
      "outputs": [],
      "source": [
        "EXPERIMENT_NAME = \"eval-qwen3\"\n",
        "\n",
        "summarization_eval_task = EvalTask(\n",
        "    dataset=eval_dataset,\n",
        "    metrics=[\"pairwise_summarization_quality\"],\n",
        "    experiment=EXPERIMENT_NAME,\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Vr2wzWZ-j9KR"
      },
      "source": [
        "### Run your evaluation\n",
        "\n",
        "This is where the magic happens. Calling `.evaluate()` sends our dataset and configuration to the Vertex AI backend. A scalable, serverless job spins up, and for each row in our DataFrame, a powerful foundation model (the \"auto-rater\") reads the prompt and both responses and renders a judgment.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "JnWDqgqmj9Yc"
      },
      "outputs": [],
      "source": [
        "eval_result = summarization_eval_task.evaluate()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "89taehcWke7S"
      },
      "source": [
        "### Analyze the Results\n",
        "\n",
        "The `evaluate()` method returns a result object containing the aggregated metrics. The `.metrics_table` provides a clear, concise `pandas.DataFrame` summarizing the outcome."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "7WLQDRsfkKl1"
      },
      "outputs": [],
      "source": [
        "eval_result.metrics_table"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2a4e033321ad"
      },
      "source": [
        "## Cleaning up"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "epaxAAxnkmHk"
      },
      "outputs": [],
      "source": [
        "delete_experiment = False\n",
        "\n",
        "if delete_experiment:\n",
        "    experiment = aiplatform.Experiment(EXPERIMENT_NAME)\n",
        "    experiment.delete()"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "name": "evaluate_open_models_byod.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
